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Abstract 

We consider the Gel'fand-Pinsker problem in which the channel and state are general, i.e., possibly non- 
stationary, non-memoryless and non-ergodic. Using the information spectrum method and a non-trivial modification 
{SJ ■ of the piggyback coding lemma by Wyner, we prove that the capacity can be expressed as an optimization over the 

difference of a spectral inf- and a spectral sup-mutual information rate. We consider various specializations including 
the case where the channel and state are memoryless but not necessarily stationary. We then extend our result to 
obtain the capacity region of the general Gel'fand-Pinsker problem with rate-limited state information available at 
the decoder. As a by-product of our analyses, we also obtain an achievable second-order coding rate for the canonical 
Gel'fand-Pinsker problem. We show that there is a tradeoff between the packing and covering rates. This tradeoff 
is governed by a free parameter which can be optimized to obtain the best achievable bound on the second-order 
coding rate. In fact, we also provide achievable rates at finite blocklength. We show that for the "memory with 
stuck-at faults" binary symmetric channel with crossover probability 0.11 and fault probability 0.1, a blocklength 
of 3034 is sufficient to achieve 90% of capacity assuming an allowable decoding error probability of 1%. 
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In this paper, we consider the classical problem of channel coding with noncausal state information at the encoder, 
also know as the Gel'fand-Pinsker problem HI. In this problem, we would like to send a uniformly distributed 
£Sj ! message over a state-dependent channel W n : X n x S n — > y n , where S, X and y are the state, input and output 
alphabets respectively. The random state sequence S n ~ Ps« is available noncausally at the encoder but not at the 
decoder. See Fig. [TJ The Gel'fand-Pinsker problem consists in finding the maximum rate for which there exists a 
reliable code. Assuming that the channel and state sequence are stationary and memoryless, Gel'fand and Pinsker (TJ 
j2j " showed that this maximum message rate or capacity C = C(W, Ps) is given by 

C= max I(U;Y)-I(U;S). (1) 

Pu\s,g--UxS^X 
\U\<\X\\S\+1 

The coding scheme involves a covering step at the encoder to reduce the uncertainty due to the random state 
sequence and a packing step to decode the message (H Chapter 7]. Thus, we observe the covering rate I(U ; S) and 
the packing rate I(U; Y) in ([TJ). A weak converse can be proved by using the Csiszar-sum-identity [2, Chapter 7]. A 
strong converse was proved by Tyagi and Narayan |3] using entropy and image-size characterizations H Chapter 15]. 
The Gel'fand-Pinsker problem has numerous applications, particularly in information hiding and watermarking (H. 

In this paper, we revisit the Gel'fand-Pinsker problem and extend the analysis in two directions. First instead 
of assuming stationarity and memorylessness on the channel and state sequence, we let the channel W n be a 
general one in the sense of Verdu and Han [6), Q. That is, W = {lT /n }^ =1 is an arbitrary sequence of stochastic 
mappings from X n x S n to y n . We also model the state distribution as a general one S ~ {Ps™ £ ^(^"OjnLi- 
Such generality allows us to understand optimal coding schemes for general systems. We prove an analogue of the 
Gel'fand-Pinsker capacity in ([TJ by using information spectrum analysis Q. Our result is expressed in terms of 
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Fig. 1. The Gel'fand-Pinsker problem with rate-limited coded state information available at the decoder 1101 , II II . The message M is 
uniformly distributed in [1 : exp(n_R)] and independent of S n . The compression index Ma is rate-limited and takes values in [1 : exp(ni?d)]- 
The canonical Gel'fand-Pinsker problem [1] is a special case in which the output of the state encoder is a deterministic quantity. 



the limit superior and limit inferior in probability operations [7 ]. For the direct part, we leverage on a technique 
used by Iwata and Muramatsu JH for the general Wyner-Ziv problem [2, Chapter 11]. Our proof technique involves 
a non-trivial modification of Wyner's piggyback coding lemma (PBL) [9, Lemma 4.3]. We also find the capacity 
region for the case where rate-limited coded state information is available at the decoder. This setting, shown in 
Fig. [1] was studied by Steinberg [10] but we consider the scenario in which the state and channel are general. 

Second, a particularly useful application of the information spectrum technique is that it allows us to easily prove 
an achievable second-order coding rate Ifl2l . |[T3l for the canonical Gel'fand-Pinsker problem where the channel and 
state are stationary and memoryless. In this paper, we seek a finer characterization of achievable transmission rates 
for block codes of size n with an allowable average error probability e > 0. This we do by using a newly-developed 
nonasymptotic upper bound on the error probability of a Gel'fand-Pinsker code. 



A. Main Contributions 

There are three main contributions in this work. See Fig. [2] for the dependencies between our results and existing 
ones. 

First by developing a non-asymptotic upper bound on the average probability for any Gel'fand-Pinsker problem 
(Lemma [9]), we prove that the general capacity is 

C = sup/(U;Y) -7(U;S), (2) 

where the supremum is over all conditional probability laws {Px n ,u™\S n }??=i ar *d Z(U;Y) (resp. 7(U; S)) is the 
spectral inf-mutual information rate (resp. the spectral sup-mutual information rate) [7]. The expression in (|2]) 
reflects the fact that there are two distinct steps: a covering step and packing step. To cover successfully, we need 
to expend a rate of /(U; S) + 7 (for any 7 > 0) as stipulated by general fixed-length rate-distortion theory lfl4l 
Section VI]. Thus, each subcodebook has to have at least « exp(n(J(U; S) + 7)) sequences. To decode the 
codeword's subcodebook index correctly, we can have at most rj exp(n(/(U; Y) — 7)) codewords by the general 
channel coding result of Verdu and Han [6]. We can specialize the general result in (f2]) to the following scenarios: 
(i) no state information, thus recovering the result by Verdu and Han 0, (ii) common state information is available 
at the encoder and the decoder, (iii) the channel and state are memoryless (but not necessarily stationary) and (iv) 
mixed channels and states Q. 

Second, we extend the above result to the case where coded state information is available at the decoder. This 
problem was first studied by Heegard and El Gamal ifTTl and later by Steinberg [10]. In this case, we combine 
our coding scheme with that of Iwata and Muramatsu for the general Wyner-Ziv problem JH to obtain the tradeoff 
between R^, the rate of the compressed state information that is available at the decoder, and R be the message 
rate. We show that the tradeoff (or capacity region) is the set of rate pairs (R,R^) satisfying 

i? d >7(V;S)-/(V;Y) (3) 
R< I(U;Y|V)-7(U;S|V), (4) 

for some (U, V) — (X, S) — Y. This general result can be specialized the stationary, memoryless setting fTOl . 
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Fig. 2. Summary of our contributions and dependencies between different results in the paper. Italicized results are existing ones. Main 
results are highlighted in bold. 



Finally, we demonstrate an achievable second-order coding ratd3 (which may be negative) for the Gel'fand-Pinsker 
problem where the state and channel are stationary and memoryless. By using appropriate Gaussian approximations 
on the bound we derived in the error probability for the general Gel'fand-Pinsker problem (Lemma [9]), we show 
that there exists a length-re blockcode with average error probability no larger than e and with message rate 



<n \ n 

where C is the Gel'fand-Pinsker capacity in £T|) and the second-order coding rate satisfies 



R 2 > max y/V{U; Y)®- 1 (\ 2 e 2 ) + y/V(U; 5)$ _1 ((1 - 2A)e), (6) 
Ae[o,i] 

and where the random variables (U, X, S, Y) are induced by the optimal Pmg and / in £T|) and $ _1 is the inverse of 
the cumulative distribution function of a standard Gaussian. In ©, V(U; Y) and V(U; S) are the so-called packing 
and covering dispersions QjO . Roughly speaking, — ^| + O(-^lp) is the backoff from capacity C at blocklength 

n and a non-zero error probability e > 0. We bound the O(^p) term and in doing so, are able to numerically 
provide achievable finite blocklength rates for the Gel'fand-Pinsker problem. 



B. Related Work 

The study of general channels started with the seminal work by Verdu and Han 0] in which the authors 
characterized the capacity in terms of the limit inferior in probability of a sequence of information densities. 
See Han's book Q for a comprehensive exposition on the information spectrum method. This line of analysis 
provides deep insights into the fundamental limits of the transmission of information over general channels and 
the compressibility of general sources that may not be stationary, memoryless or ergodic. Information spectrum 
analysis has been used for rate-distortion iffll . the Wyner-Ziv problem [8], the Heegard-Berger problem ifFTl . the 
Wyner-Ahlswede-Korner (WAK) problem |[T8l and the wiretap channel 1191 . l|20l . The Wyner-Ziv and wiretap 
problems are the most closely related to the problem we solve in this paper. In particular, they involve differences 
of mutual informations akin to the Gel'fand-Pinsker problem. Even though it is not in the spirit of information 
spectrum methods, the work of Yu et al. ll2Ti deals with the Gel'fand-Pinsker problem for non-stationary, non-ergodic 
Gaussian noise and state (also called "writing on colored paper"). We contrast our work to |2TI in Section IIII-EI 

Our initial motivation for this work was to derive the second-order coding rate for the Gel'fand-Pinsker channel. 
This line of research was initiated by Strassen ll22l who showed for discrete memoryless channels that if M n (e) 

'The way we define the second-order coding rate is similar to Hayashi 1 13] and thus we would like to maximize it. 
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designates the maximum message set size of (n, e)-codes, 

logM„(e)- \nC- VhVQ-^e)] < O(logra), (7) 

where C is the channel capacity and V is the variance of the log-likelihood ratio of the channel and the optimal output 
distribution. This line of analysis was revisited by Kontoyiannis for lossless source coding j23l and by Hayashi for 
intrinsic randomness |[T2l and channel coding lfT3l (for more general channel models such as the additive Markovian 
and the AWGN channels). Recently, Polyanskiy-Poor-Verdii lfl6l tightened the O(logn) term in (O to obtain finite 
blocklength results. The same authors also analyzed the finite blocklength limits of the Gilbert-Elliot channel [24], 
which is a state-dependent channel where the binary state evolves according to a Markov chain. 

To the best of the author's knowledge, the only other works on finite blocklength (or second-order coding) 
analysis for channels with state are the following: First, Ingber and Feder [25 ] considered the problem of channels 
with random states where the state S n is available at the decoder and not the encoder. In that case, the problem 
can be regarded as a channel coding problem where the output is (Y, S) instead of Y and so the dispersion can 
be derived using the law of total variance. Second, Polyanskiy and Verdu [26] derived the dispersion of the scalar 
coherent fading channel. Third, Hoydis et al. G71 considered finite blocklength analysis for the MIMO fading 
channel. Fourth, Vituri and Feder [28] obtained the dispersion of infinite constellations in fast fading channels. 
Finally, Yang et al. [29] analyzed the finite blocklength performance of codes operating over Rayleigh block-fading 
channels in the noncoherent setting. In all these works, the channel state or fading coefficients are available at the 
decoder whereas in our work the state information is available at the encoder only. Recently, Verdu generalized the 
packing and covering lemmata [2] to obtain finite blocklength bounds for a variety of network information theory 
problems [30] (including Gel'fand-Pinsker) and these bounds may produce better finite length performance than 
the technique used in this paper. 

It is also worth mentioning that bounds on the reliability function (error exponent) for the Gel'fand-Pinsker 
problem have been derived by Tyagi and Narayan [3] (upper bounds) and Moulin and Wang OTTl (lower bounds). 
However, these bounds involve several nested maximization and minimizations (over distributions) and are hard to 
evaluate compared to the finite blocklength bounds in this paper. 



C. Paper Organization 

The rest of this paper is structured as follows. In Section [III we state our notation and various other definitions. 
In Section [iIIJ we state all information spectrum-based results and their specializations. In Section ITVl we state our 
achievability result on the second-order coding rates. In Section [VJ we provide numerical results illustrating the 
second-order coding rates and finite blocklength bounds. We conclude our discussion in Section [VI] The proofs of 
all results are provided in the Appendices. 

II. System Model and Main Definitions 
In this section, we state our notation and the definitions of the various problems that we consider in this paper. 



A. Notation 

Random variables (e.g., X) and their realizations (e.g., x) are denoted by upper case and lower case serif font 
respectively. Sets are denoted in calligraphic font (e.g., the alphabet of X is X). We use the notation X n to mean 
a vector of random variables (X\, . . . , X n ). In addition, X = {X n }^ =1 denotes a general source in the sense that 
each member of the sequence X n = (x[ n ^ , . . . , Xn ) is a random vector. The same holds for a general channel 
W = {W n : X n — > y n }^ =1 , which is simply a sequence of stochastic mappings from X n to y n . The set of all 
probability distributions with support on an alphabet X is denoted as V(X). We use the notation X ~ Px to mean 
that the distribution of X is Px- The joint distribution induced by the marginal Px and the conditional Py\x is 
denoted as Px°Py\x- Information-theoretic quantities are denoted using the usual notations JH, (e.g., I(X;Y) 
for mutual information or I(Px, Py\x) if we want to make the dependence on distributions explicit). All logarithms 
are to the base e. We also use the discrete interval notation [i : j] := {i, . . . ,j} and, for the most part, ignore 
integer requirements. The notation <J>(i) := J ^ -^=e _n2//2 du (resp. Q(t) := 1 — 3>(i)) refers to the cumulative 
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(resp. complementary cumulative) distribution function of the standard Gaussian. Correspondingly, (resp. 
Q _1 ( • ) = — <J> -1 ( • )) is its inverse. 

We recall the following probabilistic limit operations fTl . 

Definition 1. Let U := {U n }^ =1 be a sequence of real-valued random variables. The limsup in probability ofXJ 

is an extended real-number defined as 

p-limsup U n := inf (a : lim F(U n > a) = o) . (8) 

The liminf in probability of U is defined as 

p-lim inf U n := — p-limsup (—U n ). (9) 

n— >oo n— >oo 

We also recall the following definitions from Han Q. These definitions play a prominent role in the following. 

Definition 2. Given a pair of stochastic processes (X, Y) = {X n ,Y n }'^ ) =1 with joint distributions {Px n ,Y n }ri=i> 
the spectral sup-mutual information rate is defined as 

1 P (Y n \X n ) 
7(X;Y) := p-limsup -log . ' '- . (10) 



n—>oo 



The spectral inf-mutual information rate I(X; Y) is defined as in (fTOl with p-lim inf in place of p-lim sup. The 
spectral sup- and inf -conditional mutual information rates are defined similarly. 



B. The Gelfand-Pinsker Problem 

We now recall the definition of the Gel'fand-Pinsker problem. The setup is shown in Fig. Q] with Ma = 0. 

Definition 3. An (n, M n , e) code for the Gel'fand-Pinsker problem with channel W n : X n x S n — > y n and state 
distribution P$^ 6 T > {S n ) consists of 

• An encoder f n : [1 : M n ] x S n — > X n (possibly randomized) 

• A decoder tp n : y n ->■ [1 : M n ) 

such that the average error probability in decoding the message does not exceed e, i.e., 

-. M„ 
— ^ P s ^s n )Y. Wn ^ C m\Um,s n \s n )<e 1 (11) 
s™£5" m=l 

where B m := {y n £ y n : tp n (y n ) = m} and B c m := y n \ B m . 

We assume that the message is uniformly distributed in the message set [1 : M n ] and that it is independent of 
the state sequence S n ~ Pgn. Here we remark that in (fTTT) (and everywhere else in the paper), we use the notation 
Ss„eS" even though S n may not be countable. This is the convention we adopt throughout the paper even though 
integrating against the measure Pgr, as in f Sn • dPs" would be more precise. 

Definition 4. We say that the nonnegative number R is an e-achievable rate if there exists a sequence of (n, M n , e n ) 
codes for which 

lim inf — log M n > R, lim sup e n < e. (12) 

n^oo n n->oo 

The e-capacity C e is the supremum of all e-achievable rates. The capacity C := lim e ^oCe- 

The capacity of the general Gel'fand-Pinsker channel is stated in Section ITlI-AI This generalizes the result in (OQ), 
which is the capacity for the Gel'fand-Pinsker channel when the channel and state are stationary and memoryless. 
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C. The Gel'fand-Pinsker Problem With Coded State Information at the Decoder 

In fact the general information spectrum techniques allow us to solve a related problem which was first considered 
by Heegard and El Gamal [11] and subsequently solved by Steinberg [10]. The setup is shown in Fig. Q] 

Definition 5. An (n, M n , M^ n , e) code for the Gel'fand-Pinsker problem with channel W n : X n x S n —> y n and 
state distribution P$™ £ V{S n } and with coded state information at the decoder consists of 

• A state encoder: /<j >n : S n — > [1 : M^ n ] 

• An encoder: f n : [1 : M n ] x S n — > X n (possibly randomized) 

• A decoder: ip n : y n x [1 : M d) „] —t [1 : M n ] 

such that the average error probability in decoding the message is no larger than e, i.e., 

1 Mn 

JT E P *< s ^ E W n (B c m:Sn \f n (m, s n ),s n ) < e (13) 

s"65™ m=l 

where B m ^ := {y n G y n : Vn (y n , f d , n (s n )) = m}. 

Definition 6. We say that the pair of nonnegative numbers (R,Rd) is an achievable rate pair if there exists a 
sequence of (n, M n , M^ n , e n ) codes such that 

lim inf — log M n > R, lim sup — log n < Rd, lim e n = 0. (14) 

The set of all achievable rate pairs is known as the capacity region c €. 

Heegard and El Gamal ifTTIl (achievability) and Steinberg ifTOl (converse) showed for the discrete memoryless 
channel and discrete memoryless state that the capacity region ^ is the set of rate pairs (R, R d ) such that 

R d >I(V;S)-I(V;Y) (15) 
R<I(U;Y\V)-I(U;S\V) (16) 

for some Markov chain (U, V) — (X, S)—Y. Furthermore, X can be taken to be a deterministic function of (U, V, S), 
\V\ < |<Y||«S| + 1 and \U\ < \X\\S\(\X\\S\ + 1). The constraint in (fT5l ) is obtained using Wyner-Ziv coding with 
"source" S and "side-information" Y. The constraint in (fT6l ) is analogous to the Gel'fand-Pinsker capacity where V 
is common to both encoder and decoder. A weak converse was proven using repeated applications of the Csiszar- 
sum-identity. We generalize Steinberg's region for the general source S and general channel W using information 
spectrum techniques in Section IIII-FI 

III. Information Spectrum Characterization of the General Gel'fand-Pinsker problem 

In this Section, we first present the main result concerning the capacity of the general Gel'fand-Pinsker problem 
(Definition [3]) in Section IIII-AI These results are derived using the information spectrum method. We then derive 
the capacity for various special cases of the Gel'fand-Pinsker problem in Section IIII-BI (two-sided common state 
information) and Section IIII-CI (memoryless channels and state). We consider mixed states and mixed channels in 
Section IIII-DI The main ideas in the proof are discussed in Section IIII-EI Finally, in Section IIII-FI we extend our 
result to the general Gel'fand-Pinsker problem with coded state information at the decoder (Definition [5). 

A. Main Result and Remarks 

We now state our main result followed by some simple remarks. The proof can be found in Appendices IA1 and |B1 

Theorem 1 (General Gel'fand-Pinsker Capacity). The capacity of the general Gel'fand-Pinsker channel with general 
states (W, S) (see Definition^) is 

C= sup J(U;Y) -7(U;S) (17) 

U-(X,S)-Y 
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Info. Spec, of \(U n ; S n ) Info. Spec, of \(U n ; Y n ) 




I(U;S) J(U;S) I(U;Y) J(U;Y) 

Fig. 3. Illustration of Theorem □ where i(t/ n ; S n ) := n" 1 log[P unl3 n(U n \S n )/P U n(U n )] and similarly for \(U n ; Y n ). The capacity is 
the difference between J(U; Y) and 1(U; S) evaluated at the optimal processes. The stationary, memoryless case (Corollary [4]l corresponds 
to the situation in which 7(U; S) = J(U; S) = I(U;S) and J(U;Y) = 7(U;Y) = I(U;Y) so the information spectra are point masses 
at the mutual informations. 



where the maximization is over all sequences of random variables (U, X, S, Y) = {U n , X n , S n , Y n }^ =1 forming 
the requisite Markov chain^ having the state distribution coinciding with S and having conditional distribution of 
Y given (X, S) equal to the general channel W. 

See Fig. [3] for an illustration of Theorem [TJ 

Remark 1. The general formula in (ITTb is the dual of that in the Wyner-Ziv case [8]. However, the proofs, 
and in particular, the constructions of the codebooks, the notions of typicality and the application of Wyner's 
PBL, are subtly different from JH, QjQ , E3- We discuss these issues in Section UlI-EI Another problem which 
involves difference of mutual information quantities is the wiretap channel El Chapter 22]. General formulas for 
the secrecy capacity using channel resolvability theory Q Chapter 5] were provided by Hayashi |fl9l and Bloch 
and Laneman [20]. They also involve the difference between spectral inf-mutual information rate (of the input and 
the legitimate receiver) and sup-mutual information rate (of the input and the eavesdropper). 

Remark 2. Unlike the usual Gel'fand-Pinsker formula for stationary and memoryless channels and states in dT), we 
cannot conclude that the conditional distribution we are optimizing over Px n ,U n \S n m G3 can be decomposed into 
the conditional distribution of U n given S n (i.e., Prj n \s n ) and a deterministic function (i.e., l{x n = g n (u n ,s n )}). 

Remark 3. If there is no state, i.e., S = in (ITTb . then we recover the general formula for channel capacity by 
Verdii and Han (VH) 

C VH =sup/(X;Y). (18) 
x 

The achievability follows by setting U = X. The converse follows by noting that I(U; Y) < I(X; Y) because 
U — X — Y (6j Theorem 9]. This is the analogue of the data processing inequality for the spectral inf-mutual 
information rate. 

Remark 4. The general formula in (ITTb can be slightly generalized to the Cover-Chiang (CC) setting ll33l in which 
(i) the channel W n : X n x 5™ x — > y n depends on two state sequences (S™,S2) ~ Ps?,s% ( m addition to 
X n ), (ii) partial channel state information S£ is available noncausally at the encoder and (iii) partial channel state 
information is available at the decoder. In this case, replacing Y with (Y, Sd) and S with S c in (ITTb yields 

C CC = sup 7(U;Y,S d )-7(U;S c ), (19) 

U-(X,S c ,S d )-(Y,S d ) 

where the supremum is over all processes (U, X, S e , Sd, Y) such that (S c , Sd) coincides with the state distributions 
{-Ps™,^™}^! and Y given (X, S c , Sd) coincides with the sequence of channels {W n }^ =1 . Hence the optimization 
in ( fT9l is over the conditionals {Px™ t u™\S™}% 3 =i- 

B. Two-sided Common State Information 

Specializing ( PT91 to the case where S e = Sd = S, i.e., the same side information is available to both encoder 
and decoder (ED), does not appear to be straightforward without further assumptions. Recall that in the usual 
scenario ll33l Case 4 in Corollary 1], we use the identification U = X and chain rule for mutual information to 

2 For three processes (X, Y, Z) = {X n , Y n , Z n }^ =1 , we say that X Y Z forms a Markov chain if X n — Y n — Z n for all n G N. 
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assert that I(X; Y, S)—I(X; S) = I(X; Y\S) evaluated at the optimal Px\s is the capacity. However, in information 
spectrum analysis, the chain rule does not hold for the p-liminf operation. In fact, p-liminf is superadditive Q. 
Nevertheless, under the assumption that a sequence of information densities converges in probability, we can derive 
the capacity of the general channel with general common state available at both terminals using Theorem Q] 

Corollary 2 (General Channel Capacity with State at ED). Consider the problem 

C ED =sup/(X;Y|S), (20) 
x 

where the supremum is over all (X, S, Y) such that S coincides with the given state distributions {Ps™}^=i an d 
Y given (X, S) coincides with the given channels {W n }^ =l . Assume that the maximizer of (1201 ) exists and denote 
it by {P X n\ S n}^=i- Let the distribution o/X* given S be {P^„| S ,}™ =1 . If 

I(X*;S) =T(X*;S), (21) 

then the capacity of the state -dependent channel with state S available at both encoder and decoder is Ced in (120b . 

The proof is provided in Appendix If the joint process (X*, S) satisfies (PJTT ) (and X and S are finite sets), it is 
called information stable IT341 . In other words, the limit distribution of n~ l \og[Pjj n \ Sn {U n \S n ) / Pjj™{U n )] (where 
Pjj n \g n = Pxn\sn) m Fig. [3] concentrates at a single point. We remark that a different achievability proof technique 
(that does not use Theorem d) would allow us to dispense of the information stability assumption. We can simply 
develop a conditional version of Feinstein's lemma (7J Lemma 3.4.1] to prove the direct part of (l20l . However, we 
choose to start from Theorem [T] which is the most general capacity result for the Gel'fand-Pinsker problem. Note 
that the converse of Corollary [2] does not require (|2"T1 ). 



C. Memoryless Channels and Memoryless States 

To see how we can use Theorem Q] in concretely, we specialize it to the memoryless (but not necessarily 
stationary) setting and we provide some interesting examples. In the memoryless setting, the sequence of channels 
W = {ly 1 }^^ and the sequence of state distributions Ps = {Ps n }™=i are such that for every (x n ,y n ,s n ) 6 
X n x y n x S n , we have W n {y n \x n , s n ) = \^ =l Wi{ yi \x h Si ), and P s ~{s n ) = Ui=i P sA s i) for some W ■ 
X x S ->■ y}^ and some {P Sl e V(S)}^1 V 

Corollary 3 (Memoryless Gel'fand-Pinsker Channel Capacity). Assume that X,y and S are finite sets and the 
Gel'fand-Pinsker channel is memoryless and characterized by and {PsA^Li- Define (f>(Px,u\Si W> ^s) '■ = 

I(U; Y) — I(U; S). Let the maximizers to the optimization problems indexed by i G N 

C(Wi,P Si )= max <KP x ,u\s;Wi,P St ) (22) 

Px,u\s 

\u\<\x\\s\+i 

be denoted as P^ v , s . : S — ^ X xU. Let P^ x ^ Uf Yi = Ps, P x u\s- ° e x X xU x y) be the joint 
distribution induced by P x Assume that either of the two limits 

1 n n 

hm -52l(Ps t ,PZ t \ Si ), Hm -Y, I ^ P Y i \u i ) (23) 

i=l i=l 

exist. Then the capacity of the memoryless Gel'fand-Pinsker channel is 

1 - 

C M 'iess = Hm inf -J2c(Wi,P Si ). (24) 

8=1 

The proof of Corollary [3] is detailed in Appendix [D) The Cesaro summability assumption in (l23l) is only required 
for achievability. We illustrate the assumption in d23l with two examples in the sequel. The proof of the direct 
part of Corollary [3] follows by taking the optimization in the general result (fTTT ) to be over memoryless conditional 
distributions. The converse follows by repeated applications of the Csiszar-sum-identity [2, Chapter 2]. If in addition 
to being memoryless, the channels and states are stationary (i.e., each W{ and each P$ z is equal to W and Ps 
respectively), the both limits in d23l exist since Pt u , s is the same for each i G N. 
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Corollary 4 (Stationary, Memoryless Gel'fand-Pinsker Channel Capacity). Assume that S is a finite set. In the 
stationary, memoryless case, the capacity of the Gel'fand-Pinsker channels given by C(W,Ps) in (Q]). 

We omit the proof because it is a straightforward consequence of Corollary [3] Close examination of the proof of 
Corollary [3] shows that only the converse of Corollary [4] requires the assumption that \S\ < oo. The achievability 
of Corollary |4] follows easily from Khintchine's law of large numbers EJ Lemma 1.3.2] (for abstract alphabets). 

To gain a better understanding of the assumption in (l23l in Corollary |3l we now present a couple of (pathological) 
examples which are inspired by [7, Remark 3.2.3]. 

Example 1. Let J := {i G N : 2 2fc ~ 1 < i < 2 2k , k G N} = [2 : 3] U [8 : 15] U [32 : 63] U . . .. Consider a discrete, 
nonstationary, memoryless channel W satisfying 



W h iej c 



Wi = \ :~; ;r; c , (25) 



where W" a , Wh : X x S — ^ y are two distinct channels. Let Pg™ = Q n be the n-fold extension of some Q G V{S). 
Let : S — > U be the ^/-marginal of the maximizer of (l22l when the channel is W m ,m G {a, b}. In general, 
I(Q, V*) £ I(Q, V*). Because liminf^oo \\J n [1 : n)\ = | ^ limsup^^ \\J n [1 : n]| = |, the first limit in 
(|23l does not exist. Similarly, the second limit does not exist in general and Corollary [3] cannot be applied. 



Example 2. Let J be as in Example Q] and let the set of even and odd positive integers be 8 and O respectively. 
Let 5, X, y = {0, 1}. Consider a binary, nonstationary, memoryless channel W satisfying 



Wi 



w h ieOnj c , (26) 
w c ies 



where W a , Wb, W c : X x S — >• y. Also consider a binary, nonstationary, memoryless state S satisfying 

where Q a , G P({0, 1}). In addition, assume that W TO ( • | • , s) for (m, s) G {a, b} x {0, 1} are binary symmetric 
channels with arbitrary crossover probabilities q ms G (0, 1). Let l : S — > U be the Z^-marginal of the maximizer 
in (|22l when the channel is W m ,m G {a, b} and the state distribution is Qi,l G {a, b}. For m G {a, b} (odd 
blocklengths), due to the symmetry of the channels the optimal a (u\s) is Bernoulli(i) and independent of s J21 
Problem 7.12(c)]. Thus, for all odd blocklengths, the mutual informations in the first limit in (1231 are equal to 
zero. Clearly, the first limit in d23l exists, equalling \l(Qh, K*b) (contributed by the even blocklengths). Therefore, 
Corollary [3] applies and we can show that the Gel'fand-Pinsker capacity is 



2 



(28) 



G(Q a ) + C(W c ,Q h 

where C(W,P S ) in (|22]> is given explicitly in E Problem 7.12(c)] and G : V{{0, 1}) -> R is defined as 

G(Q) := H min {C(iy a , Q), C(Wb, <9)} + ^ max [c(W & , Q),C(W h , Q)} . (29) 

See Appendix |E] for the derivation of (l28l) . The expression in (l28l) implies that the capacity consists of two parts: 
C(W c ,Qb) represents the performance of the system (W c , Q\>) at even blocklengths, while G{Q a ) represents the 
non-ergodic behavior of the channel at odd blocklengths with state distribution Q & ; cf. [7, Remark 3.2.3]. In 
the special case that C(W & , Q a ) = C(Wb,Qa) (e-g-> W a = WQ, then the capacity is the average of G(Q SL ) = 
C(W a ,Q a ) mdC(W c ,Q h ). 
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D. Mixed Channels and Mixed States 

Now we use Theorem Q] to compute the capacity of the Gel'fand-Pinsker channel when the channel and state 
sequence are mixed. More precisely, we assume that 



W n (y n \x\ s n ) = £ ak W£(y n \x n , s n ), (30) 

k=l 

oo 

p s <s n ) = j2PiPsr(s n y (3D 
i=i 

Note that we require Y,T=i a k = Ta=i Pi = L In fact > let £ := {/c G N : a fe > 0} and £ := {/ G N : /3 Z > 0}. 
Note that if Sf is a stationary and memoryless source, S n the composite source given by (DTK is a canonical 
example of a non-ergodic and stationary source. By (DOK the channel W n can be regarded as an average channel 
given by the convex combination of |/C| constituent channels. It is stationary but non-ergodic and non-memoryless. 
Given Px»,u n \S n > define the following random variables which are indexed by k and I: 

(sp, xp , up, y fc 7) ~ p s? o p XntUn]Sn o wp. 02) 

Corollary 5 (Mixed Channels and Mixed States). The capacity of the general mixed Gel'fand-Pinsker channel with 
general mixed state as in (l30ll— (l32l) is 



C= sup inf I(U,;Y w )-supI(U,;S,n (33) 

U-(X,S)-Y {(k,l)eKxC i eC J 

where the maximization is over all sequences of random variables (U, X, S, Y) = {U n , X n , S n , Y n }^ =1 with state 
distribution coinciding with S in (1311 ) and having conditional distribution of Y given (X, S) equal to the general 
channel W in (130b - Furthermore, if each general state sequence SV and each general channel WJ} is stationary 
and memoryless, the capacity is lower bounded as 



C> max <^ inf I{U V , Y kl ) - sup/(C/ z ; Si) } (34) 
U-(X,S)-Y {(k,l)eKxC ieC J 

where (Si,Xi, Ui,Y k i) ~ Ps, ° Px,u\S Wfc ^ e maximization is over all joint distributions Pjj,x,S,Y satisfying 
Pu,x,s,Y = Efc,/ otkPiPs t Px,u\sWk far some P x ,u\s- 

Corollary [5] is proved in Appendix |F] and it basically applies \1 , Lemma 3.3.2] to the mixture with components 
in (l32l . Different from existing analyses for mixed channels and sources Q, JH, here there are two independent 
mixtures — that of the channel and the state. Hence, we need to minimize over two indices for the first term in d33l . 
If instead of the countable number of terms in the sums in (1301) and (f3TT >. the number of mixture components (of 
either the source or channel) is uncountable, Corollary [5] no longer applies and a corresponding result has to involve 
the assumptions that the alphabets are finite and the constituent channels are memoryless. See |7, Theorem 3.3.6]. 

The corollary says that the Gel'fand-Pinsker capacity is governed by two elements: (i) the "worst-case" virtual 
channel (from U n to y n ), i.e., the one with the smallest packing rate 7(Uz; Y k i) and (ii) the "worst-case" state 
distribution, i.e., the one that results in the largest covering rate I(U;;Sj). Unfortunately, obtaining a converse 
result for the stationary, memoryless case from d33l does not appear to be straightforward. The same issue was 
also encountered for the mixed wiretap channel j20l . 



E. Proof Idea of Theorem \J\ 

1) Direct part: The high-level idea in the achievability proof is similar to the usual Gel'fand-Pinsker coding 
scheme (H which involves a covering step to reduce the uncertainty due to the random state sequence and a 
packing step to decode the transmitted codeword. However, to use the information spectrum method on the general 
channel and general state, the definitions of "typicality" have to be restated in terms of information densities. See 
the definitions in Appendix |A] The main burden for the proof is to show that the probability that the transmitted 
codeword U n is not "typical" with the channel output Y n vanishes. In regular Gel'fand-Pinsker coding, one appeals 
to the conditional typicality lemma |1, Lemma 2] |2, Chapter 2] (which holds for "strongly typical sets") to assert 
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that this error probability is small. But the "typical sets" used in information spectrum analysis do not allow us 
to apply the conditional typicality lemma in a straightforward manner. For example, our decoder is a threshold 
test involving the information density statistic rT l \og(P Y r l \jj n / Py^). It is not clear in the event that there is no 
covering error that the transmitted U n codeword passes the threshold test (i.e., n~ l log(Pyni[/n/fV") exceeds a 
certain threshold) with high probability. 

To get around this problem, we modify Wyner's PBI_H |9, Lemma 4.3] |[35l Lemma A. 1] accordingly. Wyner 
essentially derived an analog of the Markov lemma (21 Chapter 12] without strong typicality by introducing a new 
"typical set" defined in terms of conditional probabilities. This new definition is particularly useful for problems 
that involve covering and packing as well as having some Markov structure. Our analysis is somewhat similar to the 
analyses of the general Wyner-Ziv problem in [8j and the WAK problem in lfl"8ll . [32 ] . This is unsurprising given 
that the Wyner-Ziv and Gel'fand-Pinsker problems are duals f33l . However, unlike in (8), we construct random 
subcodebooks and use them in subsequent steps rather than to assert the existence of a single codebook via random 
selection and subsequently regard it as being deterministic. This is because unlike Wyner-Ziv, we need to construct 
exponentially many subcodebooks each of size « exp(n/(U; S)) and indexing a message in [1 : M n ]. We also 
require each of these subcodebooks to be different and identifiable based on the channel output. Also, our analogue 
of Wyner's "typical set" is different from previous works. 

We also point out that Yu et al. ETTl considered the Gaussian Gel'fand-Pinsker problem for non-stationary and 
non-ergodic channel and state. However, the notion of typicality used is "weak typicality", which means that 
the sample entropy is close to the entropy rate. This notion does not generalize well for obtaining the capacity 
expression in (fTTT ). which involves limits in probability of information densities. Furthermore, Gaussianity is a 
crucial hypothesis in the proof of the asymptotic equipartition property in lf2Tl . The technique in ||2~T1 does not 
seem to be amendable to second-order coding (or finite blocklength) analysis. Because we used the information 
spectrum method, achievable second-order coding rates are fairly easy to obtain. 

2) Converse part: For the converse, we use the Verdu-Han converse Q Lemma 3.2.2] and the fact that the 
message is independent of the state sequence. Essentially, we emulate the steps for the converse of the general 
wiretap channel presented by Bloch and Laneman |20l Lemma 4]. 

F. Coded State Information at Decoder 

We now state the capacity region of the coded state information problem (Definition [5]). 

Theorem 6 (Coded State Information at Decoder). The capacity region of the Gel'fand-Pinsker problem with coded 
state information at the decoder ^ (see Definition® is given by the set of pairs (R,R d ) satisfying 

R< J(U;Y|V) -T(U;S|V) (35) 
R d >7(V;S)-/(V;Y) (36) 

for (U,V,X,S,Y) = {U n ,V n ,X n ,S n ,Y n }™ =1 satisfying (U,V) - (X,S) - Y having the state distribution 
coinciding with S and having conditional distribution ofY given (X, S) equal to the general channel W. 

A proof sketch is provided in Appendix [G] For the direct part, we combine Wyner-Ziv and Gel'fand-Pinsker 
coding to obtain the two constraints in Theorem [6] To prove the converse, we use exploit the independence of the 
message and the state, the Verdu-Han lemma (7J Lemma 3.2.2] and the proof technique for the converse of the 
general rate-distortion problem [7, Section 5.4]. Because the proof of Theorem [6] is very similar to Theorem [T] we 
only provide a sketch. We note that similar ideas can be easily employed to find the general capacity region for 
the problem of coded state information at the encoder (and full state information at the decoder) Il3"6*1 . In analogy 
to Corollary 01 we can use Theorem [6] to recover Steinberg's result ifTOl for the stationary, memoryless case. See 
Appendix |H] for the proof. 

3 One version of the piggyback coding lemma (PBL), given in [35 Lemma A. 1], can be stated as follows: If U — V — W are random 
variables forming Markov chain, (U n ,V n ,W n ) ~ Yl™ =1 Pu,v{ui,Vi)P W \ V (m\vi), and ip n : U n x W™ — > [0, 1] is a function satisfying 
Ei/) n (i7 n , W n ) — >• 0, then for any given e > 0, for all sufficiently large n there exists a mapping <?„ : V" — > W n such that (i) i log ||<?n|| < 
I(V; W) +s and (ii) Eip n (U n , g n (V n )) < e. The function ip„(u n , w n ) is usually taken to be the indicator that (u n , w n ) are jointly typical. 
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Corollary 7 (Coded State Information at Decoder for Stationary, Memoryless Channels and States). Assume that 
S is a finite set. The capacity of the Gel'fand-Pinsker channel with coded state information at the decoder in the 
stationary, memoryless case is given in ( 1151 ) and d 1 6b - 



IV. An Achievable Second-Order Coding Rate 

In this section, we leverage on our analysis for the general Gel'fand-Pinsker channel, and in particular Lemma|9j 
to derive an achievable second-order coding rate. We assume that the channel and the state are stationary and 
memoryless in this section, i.e., W n (y n \x n , s n ) = YY!=i ^{Vi\ x ii s i) an ^ Ps n {s n ) = Y\a=i Ps( s i)- The alphabets 
S,X,y do not have to be finite sets. We now formalize the notion of the (n, e)-capacity. 

Definition 7. We say that the nonnegative number R is an (n, e)-achievable rate for the Gel'fand-Pinsker channel W 
with random state S if there exists an (n, M n , e) code (per Definition^ where M n > exp(nR). The (n, e)-capacity 
C(n, e) is the supremum over all (n, e)-achievable rates. 

In information theory H, Q, the term capacity is defined as an asymptotic quantity so the (n, e)-capacity defined 
above is admittedly an abuse of terminology. Nonetheless, we aim to bound the (n, e)-capacity of the Gel'fand- 
Pinsker channel. However, it is more convenient at this point to characterize the second-order coding rate of the 
Gel'fand-Pinsker problem lfl2l . [fT3l . To do so, consider the following definition. 

Definition 8. We say that the real number i?2 (which may be negative) is a second-order e-achievable rate for the 
Gel'fand-Pinsker channel W with random state S if there exists a sequence of (n, M n , e n ) codes for which 

lim inf — J= (logM n — nC e ) > R2, lim sup e n < e, (37) 

n-Kx) Jn n-s-oo 



wher C e is the e-capacity of the Gel'fand-Pinsker channel per Definition^ The second-order coding rate R2(e\W, S) 
is the supremum of all second-order e-achievable rates. 

Note from the strong conversed that C e = C for all e G (0, 1). Hence C e in (l37l ) is the Gel'fand-Pinsker capacity 
in (TjQ). For simplicity, let us assume that the optimum in (0Q) is unique]! We denote it as P£ = P^ s ° l{x = 
g*(u, s)} where P^ s and g* optimize ([TJ). The definition of R2 = R2(e\W, S) in (l37l) means we are searching over 
all Gel'fand-Pinsker codes where the number of messages M n satisfies 

log M n >nC + ^R 2 +o(y/n). (38) 

We seek the largest possible R2 such that the error probability is no larger than e in the limit of large blocklengths. 
Fix a tuple (U, X, S, Y) distributed as P$ o P^g ° l{x = g*(u, s)} o W n . The packing information dispersion and 
covering information dispersion^ are respectively defined as 



V(U-Y) := Var 
V(U;S) := Var 



, Py\u(y\U) 

Pu\s(u\sy 



log 



Pu(U) 



(39) 
(40) 



For the moment, we assume that both dispersions are positive and we partially address the issue of zero covering 
dispersion in Remark [6] Furthermore, define 

R 2 (e\W,S):= max ( JV(U; y)$" 1 (A 2 e 2 ) + JV(U; 5)$ _1 (( 1 ~ 2A ) e )i • ( 41 ) 
Ae[o,i] I J 

4 The strong converse was proved for discrete memoryless systems [3'j but it can be argued that C e = C also holds for arbitrary alphabets 
subject to the condition that the probability density functions are well-behaved, i.e., continuous and bounded. For if this were not so, then 
there would exist a discrete memoryless Gel'fand-Pinsker problem with capacity C given in l[T), created via the discretization procedure in 
(2] Chapter 3], such that its e-capacity C t > C. This contradicts the strong converse (3). 

5 If the optimum in (T) is not unique, then we can further maximize the expression in j41 1 over all pairs of conditional distributions and 
functions (Pjjis>9*) that achieve the maximum in (TJ. 

6 Dispersion, a term coined by Polyanskiy-Poor-Verdu 1161 , has traditionally been defined as an operational quantity. Hence we append 
the word "information" to "dispersion" to emphasize that l |39t and J40t > are information-theoretic (and not operational) quantities. 
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P(S = 0) = § P(5 = l) 

Fig. 4. Memory with stuck-at faults. This is a generalization of (2] Example 7.3]. 




The maximum in (14TI ) indeed exists and belongs to the open interval (0, |). This is because if A = (resp. A = |), 
the first (resp. second) term is — oo. Away from these endpoints, the objective function is finite. 

Theorem 8 (Lower Bound to Second-Order Coding Rate). Assume that V(U ; Y) and V(U; 5) are positive and 
the third moments of the information densities log(PY\u(Y\U) / Py (Y)) and log(Pu\g(U\S)/Pjj(U)) exist and are 
finite. If e G (0, |), the following lower bound holds on R,2(e\W, S): 

R 2 (e\W,S)>R 2 (e\W,S). (42) 

This theorem is proved in Appendix U 

Remark 5 (Achievable Finite Blocklength Rate). In fact, our proof shows that the (n, e)-capacity can be lower 
bounded for all blocklengths n as 

C(n,e)>C + Mel ^ S) -9 n (e), (43) 

where 6 n (e) = 0(- 2£ ^). We can characterize n (e) nonasymptotically using the Berry-Esseen theorem ll37l (stated 
in Lemma |TQl >. Because the expression is complicated (involving implicitly specified constants), we do not present 
it here. Interested readers may note that the third order term is given by the sum of £i and £2 defined in (11571 ) 
and (11581 ) respectively. Thus, the o(y/n) term in (l38l ) is in fact O(logn). 

Remark 6. If the optimizing distribution P£ ^ in £0) has ^-marginal P^ s that is independent of 5 (e.g., the 
binary symmetric channel with state in [2j Problem 7.12(c)] ifTTl Example 2]), then the covering dispersion V(U; 5) 
in (1401 is zero. Then, by using Shannon strategies lfl31 and Feinstein's lemma [7, Lemma 3.4.1], we see that the 
second-order coding rate is at least yV(U ; Y)<&~ l {e) (for the optimizing Pyy) m line with the works on finite 
blocklength analysis for channel coding lfT3ll . lfl6l . 11221 . 

The interpretation of Theorem [8] is as follows: First, the (n, e)-capacity, which is lower bounded as in (1431 . 
converges to the capacity in (Q} at a speed of ^(-7=) in line with the central limit theorem. Second, there 
are two distinct contributions in the rate of convergence (the implied constant in f2(-4=)): the covering rate 

y/V(U; S)<£ -1 ((l-2A)e) and the packing rate \/V(U; Y)^ 1 (X 2 e 2 ). These contributions result from the estimate 
of the error probability we obtained in the course of proving Theorem Q] using a combination of Wyner's PBL (9j 
Lemma 4.3] and Feinstein's lemma |7 ( Lemma 3.4.1] (for packing). Our estimate is presented as Lemma [9] and 
involves two information spectrum quantities. 

V. Numerical Results 

In this section, we calculate the lower bound on the second-order coding rate and (n, e)-capacity for a gener- 
alization of the memory with stuck-at faults example in (2l Example 7.3]. The channel and state are illustrated in 
Fig. HI The state S = corresponds to a faculty memory cell that outputs independent of the input value, the 
state 5 = 1 corresponds to a faulty memory cell that outputs a 1 and the state 5 = 2 corresponds to a binary 
symmetric channel with crossover probability 5. The probabilities of these states are | , | and 1 — q, respectively. 
It is known IfTTl that the capacity with noncausal state information at the encoder (Gel'fand-Pinsker) is 

C =(l-q)(log2-h h (5)), (44) 

where hb(5) := —5 log 5 — (1 — S) log(l — 5) is the binary entropy function. 

We start by examining the tradeoff between the packing and covering second-order coding rates in (|4TT ). We set 
5 = 0.11, q = 0.1 and e = 0.1 for the memory with stuck-at faults channel. The optimal P^, s and g* can be found 
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Fig. 5. Tradeoff between ^V(U; Y)$' 1 (X 2 e 2 ) (packing) and y/ V(U; S)®' 1 ((1 - 2A)e) (covering) as functions of A for e = 0.1 The 
sum curve is the function to be maximized in the lower bound on the second-order coding rate in J4U . The optimal A is the maximum of 
the sum curve and occurs at A* = 0.402 in this example. 




100 



0.1 0.2 0.3 0.4 

Fault probability q = 2 Pr(S = 1) 



0.5 



3 




65 



-6 — GA (n=10 3 ) 
6 - FB (n=10 3 ) 

-* — GA (n=10 4 ) 
* - FB (n=10 4 ) 

~V — GA (n=10 5 ) 
V - FB(n=10 5 ) 



0.1 0.2 0.3 0.4 

Fault probability q = 2 Pr(5 = 1) 



0.5 



Fig. 6. Left: Optimal A* as a function of the fault probability q for 5 — 0.11 and error probability e = 0.1. Right: Gaussian approximation 
(GA) and finite blocklength (FB) (or (n, e)-capacity) as a fraction of capacity for various blocklengths. 



analytically ifTTl Corollary 1] and thus, we can compute the packing rate \JV{U ; l")<I>~ 1 (A 2 e 2 ) and the covering 
rate \fV{JJ\ S')<I> _1 ((1 — 2A)e) as functions of A G [0, |]. These constituent second-order coding rates, together with 
their sum, are plotted in Fig. [5] We observe that the packing rate (resp. covering rate) is monotonically increasing 
(resp. monotonically decreasing) due to monotonicity of the <1> _1 function. The optimum A* that maximizes the 
sum is thus attained in the interior of [0, i]. 

In Fig. |6l we fix 5 = 0.11, e = 0.1 and varied q, the fault probability governing the state distribution. The 
optimum parameter A* is plotted as a function of q on the left of Fig. [6] Observe that when q — > 0, the tradeoff is 
biased towards the packing term (A* — > 0.5). This is intuitive since the state distribution becomes degenerate and 
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e = 0.01 e = 0.30 




Blocklength n Blocklength n 



Fig. 7. Gaussian approximation (GA), (n, e)-capacity (FB) as a function of the blocklength n for different average error probabilities e 

the Gel'fand-Pinsker problem reduces to channel coding, which only involves packing. We computed the ratios 
(C + R 2 (e\W, S)/y/n)/C and (C + R 2 {e\W, S)/y/n - 6 n (e))/C where C is the capacity in ©. These ratios are, 
respectively, the fraction of capacity based on (i) the Gaussian approximation lower bound (first two terms in (03])) 
and (ii) the finite blocklength or (n, e)-capacity lower bound taking into account the third order term 6 n (e) in (l43l . 
In our implementation of (ii), we use numerical methods to find 9 n {e) (cf. (1167b and (11681 )) and we also use the 
best known Berry -Esseen constant Tbe < 0.4748 11371 . These ratios are plotted as functions of q for various n 
in the right panel of Fig. [6] We observe that the backoff from capacity is the smallest when the state distribution 
is deterministic (i.e., when q = 0) because the uncertainty in channel parameters is reduced. In this scenario, the 
problem reduces to channel coding but the second-order coding rate in (l4l"l) is expected to be suboptimal (compared 
to the works that focus purely on channel coding fl3l . lfT6l . (22 ]) because of the different bounding technique we 
used to account for the random state. 

In Fig. |7J we plot the Gaussian approximation and (n, e) -capacity as functions of n for fixed e for the memory 
with stuck-at faults channel with 5 = 0.11 and q = 0.1. We see that a blocklength of n « 3034 is sufficient to get us 
to 90% of capacity if we allow an error probability of e = 0.01. Compared to finite blocklength channel coding lfT6*l . 
this blocklength is considered long for this target error probability. Of course, this is only a sufficient condition 
(achievability). However, the long blocklength may be attributed to two factors: First, our bounding technique 
based on Wyner's PBL |9] may be loose at finite blocklengths (Lemma [9]), resulting in a loose lower bound for the 
second-order coding rate. Recently, Verdii 1301 proposed new achievability bounds for some network information 
theory problems that may result in better finite length performance than that demonstrated above. Second, by the 
very nature of the problem, the random state induces uncertainty in channel behavior, and thus we need a much 
longer blocklength to get to a fixed percentage of capacity. In Fig. [U we plot the Gaussian approximation and 
(n, e)-capacity as functions of e for fixed n. The plot shows that as the error probability increases, the backoff from 
capacity is reduced. Notice from Figs. [6j [7] and [8] that the third order terms have a non-negligible effect even at 
moderately long blocklengths of n « 10 3 . 

VI. Conclusions and Further Research 

In this work, we derived the capacity of the general Gel'fand-Pinsker channel with general state distribution using 
the information spectrum method. As a by-product, we obtained a lower bound for the second-order coding rate (an 
achievability result). It would be desirable derive a corresponding upper bound (a converse result) to the second- 
order coding rate. Tyagi and Narayan's strong converse technique |3 ] may be useful in this endeavor. But tightening 
the image-size characterization technique H Theorem 15.10] to the finite blocklength regime is difficult because 
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Fig. 8. Gaussian approximation (GA), (n, e)-capacity (FB) as a function of the average error probability e for different blocklengths n 



it involves finding tighter bounds for the difference between the log-cardinalities of an 771- and an 772-image \4, 
Lemma 6.6], which requires the blowing-up lemma [4, Lemmas 5.1 and 5.4]. Such a tightening of image-sizes 
for two channels will automatically lead to a finite blocklength converse for the broadcast channel with degraded 
message sets |[38l . Another option would be to generalize the meta-converse of Polyanskiy-Poor-Verdu ifTBl to 
the Gel'fand-Pinsker setting. However, this also appears to be high non-standard. Finally, although our analysis 
is general and applies to all channels, it may be interesting to see whether a better identification of the optimal 
conditional distribution P£ n Un \ S n in Theorem [8] (instead of the product of conditionals P^ v ^ s ) may provide a 
better lower bound on the second-order coding rate for Costa's "writing on dirty paper" model l39"1 . This will shed 
light on how much loss in second-order coding rate there is for dirty paper coding relative to coding for the AWGN 
channel without state. 



Appendix 

A. Proof of Theorem [7J 

Basic definitions: Fix 61,62 G (0,1), 71,72 > and some conditional distribution Px n ,U n \S n - Define the sets 

Ti :=Uu n ,y n )GU n xy n :- log ^ yn) > I(U; Y) — (1 — £1)71 j (45) 

T 2 := (u n , s n ) G U x <S : — log U ' S ' > < I U; S) + (1 - 6 2 ) l2 , (46) 
[ n Pjjn.(u n ) J 

where the random variables (S n ,X n , U n ,Y n ) ~ P$n o Px^,u n \S n W n . We define the probabilities 

vr i: =P((C/ n ,y™)^ri) (47) 
vr 2 :=F((U n ,S n ) (48) 

where (U n ,Y n ) ~ Pu n ,Y n an( i (U n ,S n ) ~ Pu",S n an d where these joint distributions are computed with respect 
to Ps" o Px»,u n \S n W n . Note that 7r± and 7T 2 are information spectrum [7] quantities. 
Proof: We begin with achievability. We show that the rate 

R := J(U; Y) — 71 — (7(U; S) + 72 ) (49) 

is achievable. The next lemma provides an upper bound on the error probability in terms of the above quantities. 



17 



Lemma 9 (Nonasymptotic upper bound on error probability for Gel'fand-Pinsker). Fix a sequence of conditional 
distributions {Px* ,U n \S n }ri=v This specifies 7(U;Y) and 7(U; S). For every positive integer n, there exists an 
(n, exp(nii), p n ) code for the general Gel'fand-Pinsker channel where R is defined in ( 1491 ) and 

1/2 

p n :=2n 1 / +7r 2 + exp(-exp(n(5 2 72)) + exp(-n^i7i) . (50) 

The proof of Lemma [9] is provided in Appendix IB1 Now, for any fixed 71,72 > and 61,62 £ (0, 1), the last 
two terms in (1501 tend to zero. By the definition of spectral inf-mutual information rate, 

/ 1 .Pi (Y n \U n ) \ 
TTi = P - log y "' , ' < J(U; Y) - (1 - 5i) 7 i (51) 



goes to zero. By the definition of the spectral sup-mutual information rate, 



1 P (U n \S n ) 

X-2 - F ( - l0 § > 7(U; ^ + (1 _ ^ )72 1 <52) 



also goes to zero. Hence, in view of (I50I ). the error probability vanishes with increasing blocklength. This proves 
that the rate R in d49l ) is achievable. Taking 71, 72 — > and maximizing over all chains U — (X, S) — Y proves 
the direct part of Theorem Q] 

For the converse, we follow the strategy for proving the converse for the general wiretap channel as done by 
Bloch and Laneman EOl Lemma 4]. Consider a sequence of (n,M n ,e n ) codes (Definition [3]) achieving a rate 
R n = ;MogM n . Let U n £ U n denote an arbitrary random variable representing the uniform choice of a message 
in [1 : M n ]. Because the message is independent of the state (cf. discussion after Definition [3]), this induces the 
joint distribution Ps" o Py n o Px»\u n ,S n W n where Px n \u n ,S n models possible stochastic encoding. Clearly by 
the independence, 

7(U;S) = 0. (53) 

Let the set of processes (S, U, X, Y) in which each collection of random variables (S n , U n , X n ,Y n ) is distributed 
as Ps« o Pjjn o Pxn\un,s n W n (resp. Ps^ o Pjj n \s^ Px™\U n ,S n W n ) be Iw,s for "independent" (resp. X\y,s 
for "dependent"). Fix 7 > 0. The Verdu-Han converse theorem [7 ( Lemma 3.2.2] states that for any (n,M n ,e n ) 
code for the general virtual channel P Y n\un(y n \u n ) := Yl x ™ s n W n (y n \x n .s n )Pxnmn ) sn.(x n \u n , s n )Ps"-(s n ), 

1 P Y n lun (Y n \U n ) 1 \ 

_ lo § 5 — rC^T\ - _ lo § M « _ 7 - exp(-n7) , (54) 



n & PY-(Y n ) ~ n 

where U n is uniform over the message set [1 : M n ]. Suppose now that M n = [exp[n(7(U; Y) + 27)]]. Then, 

(1 P (Y n \U n ) \ 
~ lo S Py\ Y n) ^ Y ) + 7 J - exp(-n 7 ) . (55) 

By the definition of the spectral inf-mutual information rate, the first term on the right hand side of ( f55T > converges 
to 1. Since exp(— fry) — > 0, e n — > 1 if M n = [~exp[n(7(U; Y) + 27)]]. This means that a necessary condition for 
the code to have vanishing error probability is for 7? = linin^oo R n to satisfy 

R < 7(U; Y) + 2 7 (56) 
= 7(U;Y)-7(U;S) + 2 7 (57) 

< sup {7(U;Y)-7(U;S)}+2 7 , (58) 

(s,u,x,Y)ex w ,s 

< sup {7(U;Y)-7(U;S)}+2 7 , (59) 

(S,U,X,Y)e£>w,s 

where (1571 ) follows from d53l and (1591 follows because Xw,s C T^w^ because the set of dependent processes 
includes the independent processes as a special case. Since 7 > is arbitrary, we have proven the upper bound 
of (TnT ) and this completes the proof of Theorem Q] ■ 
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B. Proof of Lemma [9] 

Proof: Refer to the definitions of the sets 71 and 7i and the probabilities m and 7T2 in Appendix |A] in which 
the random variables (S n ,X n ,U n ,Y n ) ~ P s „ o P x «,u~\s« W™- Define the mapping r):U n x S n -> M+ as 

77(«V ft ):=£ ^ ^(y>V")P x „|^ )5 „(x>V") (60) 

where -Px"|c/",5" i s tne conditional induced by Py n ,!7 n |S n ' Analogous to [9, Lemma 4.3], define the set 

A : = {(u n ,s n ) eU n x 5 n : r](u n ,s n ) < ^ 1/2 } . (61) 

Note the differences between the definition of r\ vis-a-vis that in |9 ( Lemma 4.3]. In particular, the summand in 
the definition of r\ in (l60l depends on Px™\u™,S n and me channel W n . Now, for brevity, define the "inflated rate" 

J R:=/(U;Y)- 7l (62) 

so the rate of each subcodebook is R — R = I(tJ; S) + 72. 

Random code generation: Randomly and independently generate \exp(nR)~\ codewords {u n (l) : I G [1 : 
exp(ni?)]} each drawn from P\j n - Denote the set of random codewords and a specific realization as C and c 
respectively. Deterministically partition the codewords in C into |~exp(rai?)] subcodebooks C(m) = {u n (l) : I G 
[(m — l)exp(n(-R — R)) + 1 : mexp(n(R — R))]} where m G [1 : exp(nii)]. Note that each subcodebook 
contains [exp(n(/(U; S) + 72))] codewords. Here is where our coding scheme differs from the general Wyner-Ziv 
problem (8J and the general WAK problem 1181 . 11321 . We randomly generate exponentially many subcodebooks 
instead of asserting the existence of one by random selection via Wyner's PBL (9j Lemma 4.3]. By retaining the 
randomness in the U n codewords, it is easier to bound the probability of the decoding error 63 defined in (l67l) . 

Encoding: The encoder, given m € [1 : exp(nJ2)] and the state sequence s n G S n (noncausally), finds the 
sequence u n (l) G C(m) with the smallest index I satisfying 

(u n (l),s n )eA. (63) 

If no such I exists, set I = 1. Randomly generate a sequence x n ~ Px™\u™,s™{ ' \ , u n (l),s n ) and transmit it as the 
channel input in addition to s n . Note that the rate of the code is given by R in d49l since there are [exp(nii)] 
subcodebooks, each representing one message. 

Decoding: Given y n G y n decoder declares that rh G [1 : exp(ni?)] is the message sent if it is the unique 
message such that 

(u n (0,y")€7i (64) 

for some u n (l) G C(m). If there is no such unique rh declare an error. 

Analysis of Error Probability: Assume that m = 1 and L denotes the random index chosen by the encoder. 
Note that L = L{S n ) is a random function of the random state sequence S n . We will denote the chosen codeword 
interchangeably by U n (L) or Fc(S n ) G U n . The latter notation makes it clear that the chosen codeword is a 
random function of the state. The randomness of Fc(S n ) comes not only from the random state S n but also from 
the random codewords in C. There are three sources of error: 

£ 1 ■= {iU n {l) G C(l) : (U n (l),S n ) $ A} (65) 

£ 2 :={(U n (L),Y n )£Tx} (66) 

£3 := {3 U n (l) i C(l) : (U n (l),Y n ) G 7]} (67) 

In term of £1,62 and £3, the probability of error P(£) defined in (TTTb can be bounded as 

P(£) < P(5i) + P(£ 2 n El) + F{£ 3 ). (68) 

We bound each of the probabilities above: First consider P(£i): 

P(5i) = P(V U n {l) G C{1) : (U n (l), S n ) $ A) (69) 

|C(1)| 



(70) 
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where dTOl ) holds because the codewords U n (l) are generated independently of each other and they are independent 
of the state sequence S n . We now upper bound (1701 as follows: 



|C(1)| 



where x( u ") s ") is the indicator of the set AnT 2 , i.e., 

X (u n ,s n ) := l{(u", S n )Ginr 2 }. 
Clearly, by using the definition of T 2 in (1461 ), we have that for all (u n , s n ) G A fl 72, 

i^»(« n ) > P^K| S n )exp(-n(7(U;S) + (1 - <5 2 ) 72 )). 
Thus substituting the bound in d73l into ( [711 ), we have 



P(£i)<E P ^' 



1 _ exp(-n(/(U;S) + (1 - 5 2 ) l2 ))Y J Pu^{u n \s n )x{u n , s n ) 
l-J2 P U~\s4u> n \s n Mu n ,s n ) 
+ exp [- exp(-n(7(U; S) + (1 - <5 2 ) 72 ))|C(1)|] 



|C(1)| 



(71) 

(72) 
(73) 

(74) 



(75) 



where (f75l) comes from the inequality (1 — xy) k < 1 — x + exp(y/c). Recall that the size of the subcodebook C(l) 
is |C(1)| = [exp(n(7(U;S) + 72 ))]. Thus, 



1 -5^JV-|S»(« n IOx(« n ,« n ) + exp(-exp(n5 272 )) 



= P ((C/ n , S n ) G ^ c U 7J) + exp (- exp(n5 272 )) 

< P ({U n , S n ) G A c ) + ¥({U n , S n ) G T 2 C ) + exp (- exp(n5 272 )) 

= P ((U n , S n ) G A c ) + vr 2 + exp (- exp(n5 272 )) , 



(76) 

(77) 
(78) 
(79) 



where (1771 ) follows from the definition of s n ) in (l72l . and (f79l follows from the definition of 7r 2 in (148T ). We 
now bound the first term in (f79l. We have 



(80) 

(81) 
(82) 



F{(U n ,S n ) G 
= p(7?(C/ n ,,S n ) >vr 1 1/2 ^ 

< tt~ 1/2 E (r](U n ,S n )) 

= tt- 1/2 Pu-,s4u n ,s n )v(u n ,s n ) 



7T 



-1/2 



Ed /..n „n _n „,n\ 



u ,s ,x ,y : 



TT- 1/2 F((U n ,Y n ) i Ti) 



1/2 



(83) 
(84) 

(85) 
(86) 



where (I8TI ) is by Markov's inequality and (l83l is due to the definition of r](u n , s n ) in (l60b . Equality (l84l follows by 
the Markov chain f/ n - (X n , S n )-Y n and <[86]> is by the definition of ir 1 in d47l Combining the bounds in (|79]> 
and (l86l ) yields 

P(fi) < ^i 1/2 + vr 2 + exp (- exp(n5 272 )) . (87) 



20 



Now we bound F^CiSf). Recall that the mapping from the noncausal state sequence S n to the chosen codeword 
U n (L) is denoted as F c (S n ). Define the event 

F:={(F c (S n ),S n ) eA}. (88) 

Then, P(£2 n £ f) can be bounded as 

F(£ 2 n £ I) = ¥{£ 2 n £{ n F) + F(£ 2 n £{ n j*) (89) 

<F{£ 2 r\F)+ ¥{£{ r\F c ). (90) 

The second term in (|90l is identically zero because given that £\ occurs, we can successfully find a u n G C(l) 
such that (ii n , s n ) G .A hence £f D J 1 * = 0. Refer to the encoding step in (l63T ). So we consider the first term in d90l ). 
Let Ec[-] be the expectation over the random codebook C, i.e., Ec[</>(C)] = ^2 C V(C = c ) ( / ) ( c )' where c runs over 
all possible sets of sequences {u n (l) G W n : / G [1 : exp(nR)]}. Consider, 

P({(F C (S"),Y") £ 7I}n{(F c (S n ),S n ) G 4}) 



Er 



E r 



E 

( S ",?/"):(i ;, c(s"),J/")0T 1 
(F c («"),«»)e-4 



E 



E 

x™ y":(F c (s"),y»)£Ti 



(91) 



(92) 



E p ^ n ) 

':(F c (s»),s«)e^. 

Equality d9TT > can be explained as follows: Conditioned on {C = c} for some (deterministic) codebook c, the 
mapping F c : S n — > U n is deterministic (cf. the encoding step in (1631 )) and thus {U n = F c (s n )} holds if {S n = 
s n } holds. Therefore, the conditional distribution of Y n given {S n = s n } and hence also {U n = F c (s n )} is 
Ylx n W n { ■ \x n , s n )Px"-\u n ,S n ( xn \F c (s n ), s n ) for fixed c. The step (|9T| ) differs subtly from the proofs of the general 
Wyner-Ziv problem |8] and the general lossless source coding with coded side information problem ifTBl . |[32l . In (81, 
|[T8l and |[32l , the equivalent of Y n just depends only implicitly the auxiliary U n through another variable, saj|j 
X n (i.e., Y n - X n - U n forms a Markov chain). In the Gel'fand-Pinsker problem, Y n depends on X n and S n , 
the former being a (stochastic) function of both U n and S n . Thus, given {S n = s n }, Y n also depends on the 
state S n and the auxiliary U n through the codebook C and the covering procedure specified by Fq. Now using the 
definitions of rj(u n , s n ) and A in (l60l and (f6Tb respectively, we can bound d92l as follows: 

F({(F c (S n ),Y n ) i Tx}n{(F e (S n ),S n ) G A}) 



E Ps>>(s n >?(W),s n ) 

s":(F c (s"),s™)e^ 



< Er 



E ^(*>i 

:(F c (s"),s")e.4 



1/2 



< 7T 



1/2 



Uniting (|90]> and ((95) yields 



Finally, we consider the probability P(<?3): 



PCfibnff) <ir} /2 . 
P(£ 3 ) = P ( ([/"(/), Y n ) G 71 for some U n {!) $ C(l) 

p ((c/ n ([), y n ) g 71 



< 



[exp(nij)] 

E 

= rexp(n(_R-_R))]+l 



(93) 

(94) 
(95) 
(96) 

(97) 
(98) 



7 In the Wyner-Ziv problem (8j, 1" is the source to be reconstructed to within some distortion with the help of Y n and in the WAK 
problem |18|, |32|, Y" is the source to be almost-losslessly transmitted with the help of a coded version of X n . 
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where (19B1 follows from the union bound and the fact that the indices of the confounding codewords belong to the set 
[exp(n(R — R)) + l : exp(ni?)]. Now, we upper bound the probability in d98l ). Note that if I G [exp(n(-R— R)) + l : 
exp(nR)], then U n (l) is independent of Y n . Thus, 

(IP (J), Y n ) G Tx 

= ^ iV-(« n )iV-(l/ n ) (99) 
(«»,y»)er 1 

< J] iV»K)iV»|[/»(y n k n )exp(-n(/(U;Y)- (1 - ^i) 7 i)) (100) 

<exp(-nO:(U;Y)- (1-^)7!)) (101) 

where (II 001 ) follows from the definition of T\ in (@3]>. Now, substituting (11011) into d98]) yields 

P(£ 3 ) < exp(ni?) exp (-n(/(U; Y) - (1 - (102) 
= exp(n(/(U; Y) - 7l )) exp (-n(/(U; Y) - (1 - 5i) 7 i)) (103) 
= exp (— n#i7i) (104) 

where (11031 ) follows from the definition of R in d62b . Uniting dUK <f8Tb . (l96l ) and (11041 ) shows that P(£) < p n , where 
p n is defined in (l50l ). By the selection lemma, there exists a (deterministic) code whose average error probability 
is no larger than p n . ■ 



C. Proof of Corollary \2\ 

Proof: In this problem where the state information is available at the encoder and the decoder, we can regard 
the output Y of the Gel'fand-Pinsker problem as (Y, S). For achievability, we lower bound the generalization of 
the Cover-Chiang ||33j result in (O with S d = S c = S as follows: 



C CC = sup /(U;Y,S)-/(U;S) (105) 
U-(X,S)-Y 

> J(X*;Y*,S) -7(X*;S) (106) 

= J(X*;Y*,S) -I(X*;S) (107) 

>J(X*;Y*|S), (108) 



where (11061 ) follows because the choice ({-P^nignK^Li, U n = 0) belongs to the constraint set in (1 105b . (1107b uses 
the assumption in (1201 and (11081) follows from the basic information spectrum inequality p-liminf n ^ 00 (A n + B n ) < 
p-lim inf A n + p-limsup n _ 5>00 B n (7J pp. 15]. This shows that Ced = /(X*; Y*|S) is an achievable rate. 
For the converse, we upper bound dl05l ) as follows: 

C cc < sup J(U;Y|S) (109) 

U-(X,S)-Y 

< sup I(X;Y|S) (110) 

U-(X,S)-Y 

= sup/(X;Y|S) (111) 
x 

where d 109b follows from the superadditivity of p-lim inf , dl 10b follows from the p-lim inf -version of the (condi- 
tional) data processing inequality [6, Theorem 9] and (1 1 1 1 b follows because there is no U in the objective function 
in dl 10b so taking U n = does not violate optimality in dl 10b . Since (llllb implies that all achievable rates are 
bounded above by Ced, this proves the converse. ■ 
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D. Proof of Corollary \3\ 

Proof: For achievability, note that since the channel and state are memoryless, we lower bound (fT71) by replacing 
the constraint set with the set of conditional distributions of the product from, i.e., Px^,u n \S n = EKLi PXiMSt 
for some {Px t ,u,\s t ■ S ^ X x U}^ =1 . Let .Mw,s (which stands for "memoryless") be the set of all (U, X, S, Y) 
such that Px n ,u n \S n = YYi=i Px^u^Si f° r some {Px,,u,\s,}i=i' th e channels coincide with {JI^Li W*K^=i an d 
the states coincide with {fl™ =1 Ps[} n=i- Let (Si, U*,X*,Y*) be distributed according to P Si o P x _ jj.15.0Wj, the 
optimal distribution in (l22l ). Consider, 



C > sup 

(u,x,s,Y)eyMw,s 

> p-lim inf — log 

n— >oo Tl 



lim inf — IE 

n— >oo n — ' 



1 



p-lim inf — log 



Pi 



Y"\U 



,{Y n \U r 



n P yn {Y n 
P {Yn y\ {lJ -y((Y n Y\(\J n T) 



i=l 



log 



P (Y n r ((Yn)*) 

PYf\u;(Y*\U: 



1 P Un{Sn (U n \S n ) 
p-hm sup - log — 

n->oo n Pu" (U n ) 

1 P (Un) . KSn) .((U")*\{S*- 
p-hm sup - log p » rrn u^ 



lim sup — E 



Prd Y i) 

I " 1 - 

lim inf - V /([/*; Y/) - lim sup - V /([/*; S, 



1=1 



log 



(112) 
(113) 
(114) 

(115) 



i=l 



8=1 



where d 1 131) follows by substituting the optimal (Si, U*,X*,Y*) into (11 12b . Inequality (11141 ) follows from Cheby- 
shev's inequality where we used the memoryless assumption and the assumption that the alphabets are finite so the 
variances of the information densities are uniformly bounded [7, Equation (3.2.15)]. Essentially the limit inferior 
(resp. superior) in probability of the normalized information densities become the regular limit inferior (resp. 
superior) of averages of mutual informations under the memoryless assumption. Now, we assume the first limit 
in d23l exists. Then, the lim sup in (II 151 ) is in fact a limit and we have 



-. n -. n 

C > liminf -V /([/*; Y?) - lim -V/([/*;<?, 

n— >oo n ^— ' n— >oo n 

i=l i=l 

= liminf -J2 m; Y?) -I(U*; S t ) 



(116) 
(117) 



i=l 



where d 1 17b follows from the fact that liminf„^ 00 (a n + b n ) = liminf 

n— s>oo Q"n ~\~ lim n _ >00 b n if b n converges. If 
instead the second limit in (l23l) exists, the argument from (II 151 ) to (II 17b proceeds along exactly the same lines. 
The proof of the direct part is completed by invoking the definition of C(Wj, Pg t ) in d22l) . 

For the converse, we follow the proof of l20l Corollary 1] and only assume that |<S| < 00. Let (S n , U n , X n , Y n ) 
be dummy variables distributed as Ps^ o Pf/HS™ Px n \U n ,S n W n . Note that Ps« and W n are assumed to be 
memoryless. From [7, Theorem 3.5.2], 



J(U;Y) < liminf -/(C/ n ; y r 

n— >oo ?7, 



(118) 



and if 151 < 00, we also have 



I(U;S) > limsup-^C/' 1 ;^ 

n— >oo 71 

Now we can upper bound the objective function in ( fT71 ) as follows: 



(119) 



I(U;Y)-J(U;S) 

< lim inf i/(J7"; Y n ) + liminf - -I(U"; S n ) 

n-»oo n n— >oo n 

< liminf - \l{U n ;Y n )-I{U n -S n ) 

n— >oo n L 



(120) 
(121) 
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where (1121b follows from the superadditivity of the limit inferior. Hence, it suffices to single-letterize the expression 
in [ • ] in (1121b - To start, consider 



I(U n ;Y n )-I(U n ;S n ) 

n 

= £ HU n ; Y^- 1 , Sf +1 ) - I(U n ; S^" 1 , Sf +1 ) (122) 



i=l 

n 



£ I( fi n , Y'-\ Sf +1 ; Yi) - /(*•; y*" 1 , Sf +1 ) 
i=l 

- /(c>", y*-\ s? +i; so + 1(5 <; y 4 " 1 , sf +1 ), (123) 



where (11221 ) follows from the key identity in Csiszar and Korner (4l Lemma 17.12] and d 123b follows from the 
chain rule for mutual information. Now, we relate the (sum of the) second term to the (sum of the) fourth term: 



E/C^y*- 1 ,^) > £ jfos^ii*- 1 ) (124) 

i=l i=l 

n 

= ^I(S f ,Y*- 1 \S? +1 ) (125) 

8=1 
n 

= £/(s^;y l '-\s> +1 ), (126) 

1=1 

where (11251 ) follows from the Csiszar-sum-identity \2, Chapter 2] and (11261) follows because S n = (Si, . ■ ■ , S„) 
is a memoryless process. Putting together (1121b . d 123b and dl26b , we have the upper bound: 

1 n 

C < liminf - V I(U n , Y l -\ S? +1 ; y) - I(U n , y*" 1 , S? +1 ; Si). (127) 

n— >oo n * — » 

i=l 

Note that (11271 ) holds for all Py [7»[S" °f trie product form (i.e., (U, X, S,Y) € A^w,s) because the state and 
channel are memoryless. Now define Ui := (U n ,Y l ~ l , Sf +l ) and y := %,Xi := and Sj := Sj. Clearly, the 
Markov chain (Tj — (JQ, Sj) — y is satisfied for all i £ [1 : n]. Substituting these identifications into (11271 ) yields 

1 - 

C<liminf-V max J(U<; y ) - I(E/i; Si), (128) 

n->oo n 'E/i-fX^SO-V; 
i=l 

which upon invoking the definition of C(Wj,PsJ in (1221 completes the converse proof of Corollary [3] 

We remark that in the single-letterization procedure for the converse, it seems as if we did not use the assumption 
that the transmitted message is independent of the state. This is not true because in the converse proof of Theorem Q] 
we did in fact use this key assumption. See d53l and (1571 ) where the message is represented by U. ■ 



E. Verification of (|28T ) in Example |2] 

Let £ n := £ D [1 : n] and O n := O n [1 : n] be respectively, the set of even and odd integers up to some n G N. 
To verify d28l ). we use the result in ((241) and the definitions of the channel and state in (1261 and (1271 respectively. 
We first split the sum into odd and even parts as follows: 

Cm'Igss = 7: lim inf 

Z n— >oo 

For the even part, each of the summands is C(W C , Qb) so the sequence b n := ^ X]je£ C(Wi, PgJ converges and 
the limit is C(Wc,(5b) • Let a n := ^ X^jgO CC^iPs.) b e the ° ( 1 ( 1 P art i n d 1 29b . A basic fact in analysis states 



- V C(Wi,P Si ) + - V C(Wi,Ps 4 ) 

rj ^ — * 71 ^ — * 



7? 



(129) 
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that liminf n ^ 00 (a n + b n ) = lim inf n _s>oo a n + li m n->oo b n if b n has a limit. Hence, the above liminf is 



CM'less = r lim inf 

2 n— *oo 



— lim inf 

2 n— >oo 



ieo„ 



+ \c(W c ,Q b ) 



- \O n n j\ C(w a , Q a ) + - \O n n j c | c(w- b , Q a 

n n 



+ ^C(W c ,Q h ). 



(130) 



(131) 



It can be verified that liminf^oo ^\O n fl J"| = | and lim sup^^ ^|On n J7"| = |. Thus, the liminf in ( 11311 ) is 
liminf a n = .min„ {^C(W a , Q a ) + (1 - e ) C(W b , Q a )} (132) 

{C(# a , Q a ), C(V^ b , Q a )} + i max [c{W & , Q a ), C(# b , Q a )} . (133) 
which, by definition, is equal to G(Q a ) in d29l ). This completes the verification of d28l) . 
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/<] Proof of Corollary \5\ 

Proof: Fix Px»,u n \S n - The key observation is to note that the joint distribution of (S n , U n , X n , Y n ) can be 
written as a convex combination of the distributions in d32~l) 



E> ( c n „, n rv 11 n,Tl\ 

Ps*,U n ,X n ,Y n \ s ' u ' x >y ) 



oo oo 



fc=i 



P x „ iC/ „| S „(n",x n | S ") 
fe=l i=i 

oo oo 

E E ^A p w.*r« (* n ' ^ n > * n > y n )' 



(134) 
(135) 
(136) 



fe=l z=i 



where (11361 ) follows from the definition of Ps n ,u n ,X n ,Y n i n (l32l . By Tonelli's theorem, the marginals of (f n , Y T 
and (U n , S n ) are given respectively by 



fc=i i=i 

oo 



(137) 
(138) 



i=i 



where in (11381 ) we used the fact that YlkLi a k = 1- The number of terms in the sums in (11371 ) and (11381 ) are 
countable. By Lemma 3.3.2 in Q (the spectral inf-mutual information rate of a mixed source is the infimum of 
the constituent spectral inf-mutual information rates of those processes with positive weights), we know that 



inf /(UjyjYjy), 



Z(U;Y) 

where recall that fC = {k G N : a k > 0} and C = {I G N : $ > 0}. Analogously, 



/(U;S) = sup/(U,;S,). 



(139) 



(140) 



This completes the proof of (l33l . 

The achievability statement in d34l follows by considering the optimal Perils (i n trie chain C7 — (X, S) — Y) and 
defining the i.i.d. random variables (S^XJ 1 , Up,Y$) ~ ]Xi=xPSi{si)Px,u\s{ x hUi\si)Wh{yi\xi,Si). Khintchine's 
law of large numbers (71 Lemma 1.3.2] then asserts that for every (k, I) G fC x C, we have J(U/; Y^) = I(Ui;Yki) 
and J(Uj; Si) = /(J/j; 5/), completing the proof of the lower bound in (l34l ). ■ 
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G. Proof of Theorem |6| 

Proof: We will only sketch the proof for the direct part since it combines the proof of Theorem Q] and the 
main result in Iwata and Muramatsu |8] in a straightforward manner. In fact, this proof technique originated from 
Heegard and El Gamal [11]. Perform Wyner-Ziv coding for "source" S n at the state encoder with correlated "side 
information" Y". This generates a rate-limited description of S n at the decoder and the rate is approximately 
i logMj^. Call this description V n G V a . The rate constraint is given in (l36l ) after going through the same steps 
as in (51. The description V n is then used as common side information of the state (at the encoder and decoder) for 
the usual Gel'fand-Pinsker coding problem, resulting in (l35l ). This part is simply a conditional version of TheoremQ] 
For the converse part, fix 7 > and note that 

M d , n < exp(n(R d + 7)) (141) 

for n large enough by the second inequality in (fl4l ). Thus, if V n denotes an arbitrary random variable such that 
the cardinality of its support is no larger than M^ n , we can check that (7J Theorem 5.4.1] 

- lo S p It/tA ^ - lo S M d,n + 7 J < exp(-n 7 ). (142) 
n Py n (y n ) n J 



We can further lower bound the left-hand-side of (11421 ) using (11411) which yields 
Now consider, 



P ( - lo § „ 777^ > + 2 7 ) < exp(-n 7 ). (143) 



Rd > H(V) - 2 7 (144) 

> i7(V) -#(V|S) -2 7 (145) 

> 7(V; S) - 2 7 (146) 
>7(V;S)-I(V;Y)-2 7 . (147) 

Inequality (|144|) follows from (11431 ) and the definition of the spectral sup-entropy rate [7, Chapter 1]. Inequality (1145b 
holds because by (11411 ). V n is supported on finitely many points in V n and so the spectral inf-conditional entropy 
rate H (V|S) is non-negative (T41 Lemma 2(a)]. Inequality (1146b follows because the limsup in probability is 
superadditive (H Theorem 8(d)]. Finally, inequality (11471 ) follows because the spectral inf-mutual information rate 
is non-negative (6l Theorem 8(c)]. 

The upper bound on the transmission rate in (l35l) follows by considering a conditional version of the Verdu-Han 
converse. As in (l56l ). this yields the constraint 

J R</(U;Y|V) + 27, (148) 

where U = {U n } c ^L 1 denotes a random variable representing the uniform choice of a message in [1 : M n ]. Note 
that V = {V n } c £ =l is recoverable (available) at both encoder and decoder hence the conditioning on V in d 148b . 
Also note that (S, V) is independent of U since V n is a function of S n and S n is independent of U n . Hence 
I(U;S,V) = 0. Because J(U;S|V) < J(U;S,V) and the spectral sup-conditional mutual information rate is 
non-negative UJ Lemma 3.2.1], 

7(U;S|V) = 0. (149) 

In view of (1148b . we have 

R< J(U;Y|V) -7(U;S|V) + 2 7 . (150) 



Since we have proved (1147b and (1150b . we can now proceed in the same way we did for the unconditional case in 
the converse proof of Theorem [T] Refer to the steps (1571) to ( f59b - Finally, take 7 — > and this gives (l35l) and (l36l ). ■ 
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H. Proof of Corollary \7\ 

Proof: We only prove the converse since achievability follows easily using i.i.d. random codes and Khintchine's 
law of large numbers (7J Lemma 1.3.2]. Essentially, all four spectral inf- and spectral sup-mutual information rates 
in (l35T ) and (|36"1 ) are mutual informations. For the converse, we fix (U, V) — (X, S) — Y and lower bound 
in d36l ) as follows: 



i?d>I(V;S)-J(V;Y) 
> limsup-J(F n ;S n ) 

n— ¥00 Tl 



limmf-I(V n ;Y n ) 

n— >oo n 



> lim sup - \l{V n ;S n ) - I{V n ;Y n ) 

n — Vrm Tl L 



n— >oo Tl 

> lim sup — 

n— >oo Tl 



nV",**- 1 , S? +1 ; Si) - I(V n , Y l ~\ Sf +1 ; 



i=l 



(151) 
(152) 

(153) 
(154) 



where (11521 ) follows from the same reasoning as in (11181 ) and d 1 19b (using the fact that |<S| < oo) and (11541 ) follows 
the same steps as in the proof of Corollary |4] In the same way, the condition on R in (1331 ) can be further upper 
bounded using the Csiszar-sum-identity \2, Chapter 2] and memorylessness of S n as follows 



R < lim inf — 

n— >oo n 



Y / nU n ;Y i \V n ,Y l -\s: +1 )-I(U n ;S i \V n ,Y l -\S^ 1 ) 



Li=l 



(155) 



From this point on, the rest of the proof is standard. We let V, t := (V n , Y 1 ' 1 , Sf +l ) and Ui := {U n , Vi) and Y h X, 



and Si as in the proof of Corollary [3] These identifications satisfy (Ui,Vj) — (X{,Si 
completed as per the usual steps. 



Yi and the proof can be 



/. Proof of Theorem \8\ 

Proof: The basic idea is to specialize Lemma [9] to the stationary, memoryless case by setting 71 , 72 , 5i , 62 to 
be appropriate sequences in n such that the bound on the error probability in (l50l ) does not exceed e. Instead of 
simply providing an achievable bound on the second-order coding rate, we aim to prove the lower bound on the 
(n, e)-capacity in ( |43l . That is, we also bound the third order term 9 n (e). All statements below hold for all n 6 N 
but for some sequences, we also specify their asymptotic order to provide intuition. 
To start, let the general distribution Px^,v n \S n m Theorem Q] be given specifically by 



P X n,u~\S»(x n , u n \s n ) := JJ P* xms {xi,Ui\si), 



(156) 



i=i 



where ui S (x, u\s) = P^ s (u, s)l{x = g*(u, s)} and P™ s and g* optimize (fl]). Also, because of the stationarity 
and memorylessness of the state and channel, the information spectra degenerate to one-point masses at the 
corresponding mutual information quantities, i.e., £(U;Y) = I(U;Y) and /(U; S) = I(U;S). These equalities 
follow straightforwardly by Khintchine's law of large numbers Lemma 1.3.2]. For the rest of this proof, let 
Tbe < 0.4748 be the constant in the i.i.d. version of the Berry-Esseen theorem ||37ll . Define 



1 •= 



V{U]Y) ( 5 



n 



n 



Q- 1 (AV) + 



V(U;S) 



n 



I -St 
1-So 



Q- 1 ((l-2A)e)+ " 2 



1 - 5 2 )V^ 



(157) 



(158) 



for some positive sequences ui, G ©(1) to be chosen later in (11671 ) and (I168I ). We will choose b\ G ©(^7=^) an d 



62 e e(^et^) (see G73) and (fT74l ) so G ©(^)- Now set 



71 



72 



V(U;Y) i 



n 



V(U;S) 



n 



Q- 1 ((l-2A)e)+e 2 



(159) 
(160) 
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to be sequences in n (of order 0(-5=)) for some A G [0, \}. The sum 71 + 72 represents the backoff from capacity 
C = I(U; Y) - I(U; S) at finite blocklength n. See (@9]>. We use the identifications in (11591 ) and (11601 ) to upper 
bound 7Ti and 7T2 in Lemma [9] This will give us a bound on the error probability in (l50l ). By substituting (11591 ) 
into (|47T ). we have 



7Tl 



1 Py^r/^y"!^") 



V(U;Y) i 



Q- x (A 2 6 2 )+a 



1 



E( l0 



< Q ( Q _1 (A 2 e 2 ) + 



+ 



P Y \u(Yi\Ui) 

Py(Yi) 
T be k(U;Y) 



I{U-Y) ) < -Q-\\h 2 " 



n 



(161) 
(162) 
(163) 



where (11621 ) follows from the i.i.d. nature of the random variables (U n ,Y n ) and the definition of £1 in (1157I ). 
Inequality (11631 ) follows from the Berry-Esseen theorem 11371 (Lemma \\0\ below) where the skewness or the 
standardized third moment of \og{PY\u{Y\U) / Py (Y)) is defined as 



K (U;Y) := E 



VV(U;Y) 



1 p yM Y W Tmv , 
log pW) ~ m ] 



(164) 



We assumed that V(U;Y) is non-zero and the third moment of the information density \og{P Y \ij(Y\U) / Py(Y)) 
is finite. Thus, k(U; Y) < 00. Now, we use Lemma [TT] (stated at the end of this section) with the identifications 
( = A 2 e 2 and 5 = ^= to approximate the Q function in (11631 ) as follows: 



7Ti < A e 



L(A^) + , lA /H]72 TgE ; Y) 



+ 



\f2-Kn V n 
Note that A 2 e 2 < i because A G [0, i] and e G (0, |). A similar calculation for 7T2 yields 



(165) 



7T 2 < (1 - 2A)e 



I/2e -[Q- 1 ((i-2A)e) +I , 2 /v^]V2 t be k(U>, S) 



V2 



irn 



n 



(166) 



where k(U; S) (which is also assumed to be finite) is defined analogously to k(U; Y) in (1 1 64b . Note that (11651 ) 
and (11661 ) hold for all n G N. Now, we set v\ and V2 to be the respective solutions of the following two equations: 



-[Q- 1 ((l-2A)e)+^ 2 / v ^] 2 /2 



^(Tbe k(U;Y)+t) 

^ L ^ vv . ~v^.-,y..|,- = V ^(T BB K(^;5)+r) 

Note that v\, U2 G @(1)- In (11671 ) and (I168I ). r > is a constant chosen to be 

2Ae 

r := . 

Ae + 1 

Substituting the choices of i>i and V2 into (11651 ) and (11661 ) respectively, we have 

7rx<A 2 e 2 - T 



7T 2 < (1 - 2A)e 



7i 



(167) 
(168) 



(169) 

(170) 
(171) 



for all n G N. Plugging the bounds dl70b and d 17 lb into ([50]) yields 



P(£) < 2 \ z e 



,2 2 



1/2 



+ (l-2A)e 



n J y n 

+ exp (- exp(n5 2 72)) + exp (— ndxji) 



(172) 



2S 



for all n£l Set 61 and 5 2 to be 

log n 



8, :-- 



So :-. 



2^ 
log (i log n 



n 



log n 

VWiS) (Q- 1 ((l-2A) e )+ Z ' 2 



+ 



log ( 77 logra) 



n 



(173) 
(174) 



Note that 6i = O(^p) and S 2 = 0( log ^ g " ) because the terms in [•] are 0(1). It is straightforward to verify 
that the design of 61,62,11 and 72 ensures that final two terms in (11721 ) are equal to -4=. Furthermore, since 
i 1 — > (1 — t) 1 ' 2 is concave, d 172b can be bounded as 



P(£) < 2Ae 1 



2A 2eW +(1 - 2A)e -^ + ^ = e ' 



(175) 



where the equality follows from the definition of r in d 1 69b - Hence our code satisfies the constraint on the error 
probability in (137T ) and in fact does so nonasymptotically. Now, collecting d49l ), d 1 59b and (1160b . we see that the 
rate of the code is 



- log M n = I(U; Y) - I(U; S) - 71 - 72 
n 



V(U;Y), 



S-HAV) + J ^^±*-\(l - 2A)e) - 6 - &■ 



(176) 
(177) 

(178) 

n-»oc ^/n ' 

Maximizing over the free parameter A G [0, |] gives (14Tb and (1421 as desired. ■ 

Lemma 10 (Berry-Esseen Theorem B7II ). Let Xi, . . . , X n be i.i.d. random variables with zero mean, unit variance 
and third moment k = E[|Xi| 3 ]. Then, for every n G N, 



V n V n 

Because £1,^2 G O(^lp), the second-order coding rate of the sequence of (n,M n ,e n ) codes satisfies 

liminf -L (log M n -nC) > r)$ _1 (A 2 e 2 ) + \/^(*7; 5)$ _1 ((1 - 2A)e) 



sup 



1 n 
In z — ' 

8=1 



< Z 



< 



Trek 



7? 



(179) 



where the Berry-Esseen constant Tbe i-s larger than 0.4748 H37V . 
Lemma 11 (Approximation of Q function). Let ( S (0, h) and 6 > 0. Then, 



Q(Q- L (() + 6)<( 



-(Q- 1 (0+'5) 2 /2 



Proof: Write ( = Q(Q _1 (C))- Then we have 

q(q- i (o+^)-c=- r 

Jo- 1 



Q-HO 



- 2 /2 dn 



(180) 



(181) 



Since ( < |, Q _1 (C) > 0. Furthermore the Gaussian probability density function M(u;0,l) = -f=e~ u ~/ 2 is 
monotonically decreasing for u > 0. Thus, the integral can be lower bounded by the width of the interval 6 
multiplied by Af(u; 0, 1) evaluated at the upper limit Q _1 (C) + 6. ■ 
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